[1] 1.4
Part 2. Summarizing distributions (part a)
Suppose we have a random variable \(X\), e.g.
We know the PMF/PDF \(f(x)\) and CDF \(F(x)\).
How can we summarize this distribution?
For discrete R.V. with probability mass function (PMF) \(f(x)\), the expected value of \(X\) is
\[{\textrm E}\,[X] = \sum_x x f(x) \]
Could write
\[{\textrm E}\,[X] = \sum_{x \in \text{Supp}[X]} x f(x) \]
| \(x\) | \(f(x)\) |
|---|---|
| 0 | .2 |
| 1 | .5 |
| 3 | .3 |
\[ f(x) = \begin{cases} \, .2 & x = 0 \\\ .5 & x = 1 \\\ .3 & x = 3 \\\ 0 & \text{otherwise} \end{cases} \]
What is \({\textrm E}\,[X] = \sum_x x f(x)\)?
\[\begin{aligned} {\textrm E}\,[X] &= 0 \times .2 + 1 \times .5 + 3 \times .3 \\ &= 1.4 \end{aligned}\]
You know about taking the average or mean of a set of numbers \(x_1, x_2, \ldots, x_n\):
\[ \overline{x} = \frac{1}{n} \sum_{i = 1}^n x_i \]
Just as a probability is a long-run frequency, an expectation is a long-run average.
Given PMF:
\[ f(x) = \begin{cases} \, .2 & x = 0 \\\ .5 & x = 1 \\\ .3 & x = 3 \\\ 0 & \text{otherwise} \end{cases} \]
Then \({\textrm E}\,[X] = 0 \times .2 + 1 \times .5 + 3 \times .3 = 1.4\).
Alternative method: make a vector of length \(n\) where each \(x\) appears \(n f(x)\) times:
Why does this work?
If each unique \(x\) appears \(n f(x)\) times, then
\[\begin{aligned} \overbrace{\frac{1}{n} \sum_i x_i}^{\text{Average}} &= \frac{1}{n} \sum_x x n f(x) \\ &= \frac{n}{n} \sum_x x f(x) = {\textrm E}\,[X] \end{aligned}\]
May clarify why, in any given sample, \(\overline{x} \neq {\textrm E}\,[X]\).
For continuous R.V. \(X\),
\[ {\textrm E}\,[X] = \int_{-\infty}^{\infty} x f(x) dx \]
For R.V. \(X\), consider \((X - c)^2\) for some constant \(c\). (A function of a random variable.)
Define mean squared error of \(X\) about \(c\) as \({\textrm E}\,[(X - c)^2]\).
For \(c=1\), we have:
| \(x\) | \(f(x)\) | \((x - 1)^2\) |
|---|---|---|
| 0 | .2 | 1 |
| 1 | .5 | 0 |
| 3 | .3 | 4 |
So MSE of \(X\) about \(1\) is:
\[ .2 \times 1 + .5 \times 0 + .3 \times 4 = 2.2 \]
\({\textrm E}\,[X]\) is the choice of \(c\) that minimizes MSE. (Wait for proof.)
Suppose we place a weight \(f(x)\) at each value \(x \in \text{Supp}(X)\) along a weightless rod.
Where is the center of mass, i.e. point where rod balances?
It is the point \(c\) where \(\sum_x (x - c) f(x) = 0\).
That point is \({\textrm E}\,[X]\).
Proof:
\[\begin{aligned} \sum_x (x - E[X]) f(x) &= \sum_x \left( x f(x) - E[X] f(x) \right) \\ &= \sum_x x f(x) - \sum_x E[X] f(x) \\ &= E[X] - E[X] \sum_x f(x) \\ &= E[X] - E[X] \\ &= 0 \end{aligned}\]
Consider this joint PMF \(f(x, y)\) for \(X\) and \(Y\) (e.g. state 1 militarizes, state 2 militarizes)
| \(x\) | \(y\) | \(f(x,y)\) |
|---|---|---|
| 0 | 0 | 1/10 |
| 0 | 1 | 1/5 |
| 1 | 0 | 1/5 |
| 1 | 1 | 1/2 |
\[ f(x,y) = \begin{cases} \, 1/10 & x = 0, y = 0 \\\ 1/5 & x = 0, y = 1 \\\ 1/5 & x = 1, y = 0 \\\ 1/2 & x = 1, y = 1 \\\ 0 & \text{otherwise} \end{cases} \]
What is \({\textrm E}\,[XY]\)? (will need e.g. for covariance)
\[\begin{aligned} {\textrm E}\,[XY] &\equiv \sum_x \sum_y xy f(x, y) \\ &= 0 \times 1/10 + 0 \times 1/5 + 0 \times 1/5 + 1 \times 1/2 \\ &= 1/2 \end{aligned}\]
Let \(X\) and \(Y\) be RVs. Then \(\forall a, b, c \in \mathbb{R}\), \({\textrm E}\,[aX + bY + c] = a{\textrm E}\,[X] + b{\textrm E}\,[Y] + c\)
Proof (discrete case):
\[\begin{align} {\textrm E}\,[aX + bY + c] &= \sum_x \sum_y (ax + by + c) f(x,y) \\ &= \sum_x \sum_y ax f(x,y) + \sum_x \sum_y by f(x,y) + \sum_x \sum_y c f(x,y) \\ &= a \sum_x \sum_y x f(x,y) + b \sum_x \sum_y y f(x,y) + c \sum_x \sum_y f(x,y) \\ &= a \sum_x x \sum_y f(x,y) + b \sum_y y \sum_x f(x,y) + c \sum_x \sum_y f(x,y) \\ &= a \sum_x x f_X(x) + b \sum_y y f_Y(y) + c \sum_x f_X(x)\\ &= a {\textrm E}\,[X] + b {\textrm E}\,[Y] + c \end{align}\]
Consider this code:
samp <- sample(x = c("a", "b", "c"),
size = 1000,
replace = T,
prob = c(.1, .3, .6))
## R help file says:
# sample takes a sample of the specified size from the elements of x either with or without replacement.
## function arguments:
# x: either a vector of one or more elements from which to choose, or a positive integer.
# size: a non-negative integer giving the number of items to choose.
# replace: should sampling be with replacement?
# prob: a vector of probability weights for obtaining the elements of the vector being sampled.
tens <- rep(10, 1000)What would the output of mean(samp == "a") be (approximately)?
Answer: It should be about .1, the probability of drawing an “a”.
What would the output of sum(tens[samp == "b"]) be (approximately)?
Answer: It should be about \(10 \times 300 = 3000\). (About 300 of the entries in samp should be “b”, so tens[samp == "b] should be a vector of about 300 10s, and the sum of this should be about 3000.)
\[{\textrm V}\,[X] \equiv {\textrm E}\,[(X - {\textrm E}\,[X])^2]\]
For Bernoulli RV (\(x=1\) means heads, revolution?):
\[ f(x) = \begin{cases} 1- p & x = 0 \\\ p & x = 1 \\\ 0 & \text{otherwise} \end{cases} \]
We can compute \((X - {\textrm E}\,[X])^2\) at each \(x\):
| \(x\) | \(f(x)\) | \((x - {\textrm E}\,[X])^2\) |
|---|---|---|
| 0 | \(\color{green}{1-p}\) | \(\color{red}{p^2}\) |
| 1 | \(\color{blue}{p}\) | \(\color{orange}{1 - 2p + p^2}\) |
And then variance as \({\textrm E}\,[(X - {\textrm E}\,[X])^2]\):
\[\begin{aligned} {\textrm V}\,[X] &= {\textrm E}\,[(x - {\textrm E}\,[X])^2] \\ &= \color{red}{p^2}\color{green}{(1-p)} + \color{orange}{(1 - 2p + p^2)}\color{blue}{p} \\ &= p^2 - p^3 + p - 2p^2 + p^3 \\ &= p(1 - p)\end{aligned}\]
Bernoulli example again:
\[ f(x) = \begin{cases} 1- p & x = 0 \\\ p & x = 1 \\\ 0 & \text{otherwise} \end{cases} \]
Alternative formulation for variance:
\[{\textrm V}\,[X] = {\textrm E}\,[X^2] - {\textrm E}\,[X]^2\]
What is \({\textrm E}\,[X]\)? What is \({\textrm E}\,[X^2]\)?
By this alternative formula, we then have
\[{\textrm V}\,[X] = p - p^2 = p(1-p)\]
\[\begin{align} {\textrm V}\,[X] &= {\textrm E}\,\left[(X - {\textrm E}\,[X])^2\right] \\ &= {\textrm E}\,\left[X^2 - \color{blue}{2{\textrm E}\,[X]} X + {\textrm E}\,[X]^2\right] \\ &= {\textrm E}\,[X^2] - {\textrm E}\,\left[\color{blue}{2{\textrm E}\,[X]} X\right] + {\textrm E}\,\left[{\textrm E}\,[X]^2\right] \\ &= {\textrm E}\,[X^2] - \color{blue}{2{\textrm E}\,[X]} {\textrm E}\,[X] + {\textrm E}\,[X]^2 \\ &= {\textrm E}\,[X^2] - 2 {\textrm E}\,[X]^2 + {\textrm E}\,[X]^2 \\ &= {\textrm E}\,[X^2] - {\textrm E}\,[X]^2 \end{align}\]
For a random variable \(X\),
Proof of first point:
\[\begin{align} {\textrm V}\,[X + c] &= {\textrm E}\,\left[(X + c - {\textrm E}\,[X + c])^2\right] \\ &= {\textrm E}\,\left[(X + c - {\textrm E}\,[X] - {\textrm E}\,[c])^2\right] \\ &= {\textrm E}\,\left[(X + c - {\textrm E}\,[X] - c)^2\right] \\ &= {\textrm E}\,\left[(X - {\textrm E}\,[X])^2\right] \\ &= {\textrm V}\,[X] \end{align}\]
Standard deviation:
\[\sigma[X] = \sqrt{{\textrm V}\,[X]}\]
Roughly, “how far does \(X\) tend to be from its mean”?
Alternative formula for MSE:
\[\begin{align} {\textrm E}\,[(X - c)^2] &= {\textrm E}\,\left[X^2 - 2cX + c^2\right] \\ &= {\textrm E}\,[X^2] - 2c{\textrm E}\,[X] + c^2 \\ &= {\textrm E}\,[X^2] - \color{red}{{\textrm E}\,[X]^2} + \color{green}{{\textrm E}\,[X]^2} - 2c{\textrm E}\,[X] + c^2 \\ &= \left({\textrm E}\,[X^2] - \color{red}{{\textrm E}\,[X]^2}\right) + \left(\color{green}{{\textrm E}\,[X]^2} - 2c{\textrm E}\,[X] + c^2\right) \\ &= {\textrm V}\,[X] + \left({\textrm E}\,[X] - c\right)^2 \end{align}\]
So what \(c\) should you choose to minimize MSE?
Given any RV \(X\) with associated PMF/PDF or CDF, we can compute \({\textrm E}\,[X]\) and \({\textrm V}\,[X]\).
For special types of RV, these parameters define the distribution:
But don’t get confused: any RV has \({\textrm E}\,[X]\) and \({\textrm V}\,[X]\), not just these special ones.
Sometimes “mean” means \({\textrm E}\,[X]\) (e.g. mean squared error or \(\mu\) of a normal distribution), sometimes it means “sample mean” or “average” of some numbers (e.g. mean(c(2,4,6))).
Sometimes “variance” means \({\textrm V}\,[X]\), sometimes it means “sample variance” (e.g. var(c(2,4,6))).
There is a close relationship, but remember that
mean() and var() are R functions that convert a vector of numbers (e.g. a sample) into a number.